demonstration data
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- North America > United States > Oregon > Benton County > Corvallis (0.04)
- North America > United States > California > San Mateo County > Menlo Park (0.04)
- (2 more...)
- Research Report > Experimental Study (1.00)
- Instructional Material (0.65)
- Information Technology (1.00)
- Health & Medicine (1.00)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.93)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Michigan > Wayne County > Detroit (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (5 more...)
- Energy (0.46)
- Information Technology (0.46)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Canada > British Columbia (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.99)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
Pre-trained vision language models (VLMs), though powerful, typically lack training on decision-centric data, rendering them sub-optimal for decision-making tasks such as in-the-wild device control through Graphical User Interfaces (GUIs) when used off-the-shelf. While training with static demonstrations has shown some promise, we show that such methods fall short when controlling real GUIs due to their failure to deal with real world stochasticity and dynamism not captured in static observational data. This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents through fine-tuning a pre-trained VLM in two stages: offline and offline-to-online RL. We first build a scalable and parallelizable Android learning environment equipped with a VLM-based general-purpose evaluator and then identify the key design choices for simple and effective RL in this domain. We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild (AitW) dataset, where our 1.5B VLM trained with RL achieves a 49.5\% absolute improvement -- from 17.7 to 67.2\% success rate -- over supervised fine-tuning with static human demonstration data. It is worth noting that such improvement is achieved without any additional supervision or demonstration data. These results significantly surpass not only the prior best agents, including AppAgent with GPT-4V (8.3\% success rate) and the 17B CogAgent trained with AitW data (14.4\%),
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
- (2 more...)
SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation
Chen, Qianzhong, Yu, Justin, Schwager, Mac, Abbeel, Pieter, Shentu, Yide, Wu, Philipp
Large scale robot learning has recently shown promise in enabling robots to perform complex tasks by integrating perception, control, and optionally, language understanding into a unified framework. However, they continue to struggle with long-horizon, contact-rich manipulation tasks, such as the handling of deformable objects, where supervision from demonstrations is often inconsistent in quality. In such settings, reward modeling offers a natural solution: by providing grounded progress signals, it can transform noisy demonstrations into stable supervision that generalizes across diverse trajectories. In this work, we introduce a stage-aware, video-based reward modeling framework that jointly predicts the high-level task stage and fine-grained progress within each stage. Reward labels are automatically derived from natural language subtask annotations, enabling consistent progress estimation across variable-length and heterogeneous demonstrations. This design overcomes the limitations of frame-index-based labeling, which collapses in long, variable-duration tasks such as folding a T -shirt. Our reward model demonstrates robustness to demonstration variability, generalization to out-of-distribution scenarios, and strong utility for downstream policy training. Building upon this reward model, we propose the Reward-Aligned Behavior Cloning (RA-BC) framework, which selectively filters high-quality data and reweights training samples according to reward estimates. Extensive experiments demonstrate that the reward model outperforms baselines on out-of-distribution real robot policy rollouts and human demonstration validation. Our approach achieves 83% success on folding T -shirts from the flattened state and 67% from the crumpled state--dramatically surpassing vanilla behavior cloning, which attains only 8% and 0% success under the same training dataset, respectively. Overall, our results highlight reward modeling as a key enabler for scalable, annotation-efficient, and robust imitation learning in long-horizon robotic manipulation. The long-standing vision of enabling robots to seamlessly assist humans in household chores has inspired decades of research in robotics. From tidying living spaces to preparing meals, such capabilities hold the promise of freeing up human time, and improving quality of life.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only
Zhang, Qingru, Qiu, Liang, Hong, Ilgee, Xu, Zhenghao, Liu, Tianyi, Li, Shiyang, Zhang, Rongzhi, Li, Zheng, Li, Lihong, Yin, Bing, Zhang, Chao, Chen, Jianshu, Jiang, Haoming, Zhao, Tuo
Supervised fine-tuning (SFT) has emerged as a crucial method for aligning large language models (LLMs) with human-annotated demonstrations. However, SFT, being an off-policy approach similar to behavior cloning, often struggles with overfitting and poor out-of-domain generalization, especially in limited-data scenarios. To address these limitations, we propose Self-Rewarding PPO, a novel fine-tuning method that leverages on-policy techniques to enhance generalization performance. Our approach combines the strengths of SFT and proximal policy optimization (PPO) to achieve more effective alignment from demonstration data. At its core is a reward function designed as the log policy ratio between the SFT model and the pretrained base model. This function serves as an implicit reward signal, using the pretrained policy as a baseline and the SFT policy as a target. By doing so, it enables on-policy fine-tuning without relying on human preference annotations. The integration of this self-rewarding mechanism with PPO addresses key limitations of SFT, improving generalization, data efficiency, and robustness. Our empirical evaluation across a range of natural language processing tasks demonstrates that Self-Rewarding PPO consistently outperforms traditional SFT methods. The results highlight the effectiveness of our approach in aligning LLMs using demonstration data, particularly in scenarios where high-quality annotated data is scarce.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)